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1. INTRODUCTION 

The new technological development has allowed more sensors to be associated with the internet of 
things (IoT). The number of devices connected to the IoT will exceed 20 billion by 2021, and the data 
generated from these devices will reach more than 2.5 quintillion bytes per day [1]-[3]. Moreover, people's 
desire to sense and collect data has increased exceedingly due to the effect of artificial intelligence (AI) 
technology on their ability to use data. Consequently, many sensor devices are more deployed in several 
applications for it sensing and obtaining data [4], [5]. WSN was originally designed to have features of low 
data rate escorted with low complexity and energy consumption. These features enable WSN to be deployed 
at a large-scale as well as within a low implementation cost [6]. Wireless Sensor Networks can be defined as 
a system of nodes that collaboratively sense, monitor, capture, process, and control other input-output (in 
form of data or signals) of other systems or support dealings between computational systems, people and the 
surrounding neighbor. WSNs present an influential integration of disseminated processes of sensing, 
communication, and computation [7]-[9]. There are many conceivable WSN applications almost to every 
sector. Basically, boundless from environmental monitoring and management, tracking and positioning, 
medical and health care services, and military, industrial and transportation applications [10]-[13]. In our day, 
modern technologies have worked efficiently and at an accelerated pace to decrease the weight and size of 
the sensor nodes, where they can be characterized by high sensing capabilities, processing, and wireless 
communication, likewise improving the accuracy of sense. Due to a high sensing node density spread across 
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the target environment, there is great redundancy between physical phenomena data measurements in this 
environment. Therefore, during various data packets forwarded to the sink, data aggregation is applied to 
multiple packets, so the volume of data after aggregation is definitely much less than the total size of the 
original data, so this technique aims to decrease redundant data transmissions to the upper layer, thus 
enhancing the overall performance of WSNs. Several studies have been proposed that dealt with the use of 
the three algorithms in the wireless network environment, which are as: Ullah and Youn [14] have suggested 
a model of data aggregation for reducing network degradation and energy depletion. After the data is 
eliminated the redundant and outliers, the self-organized map algorithm is used to re-cluster these data before 
forwarding them to the base station. Mittal and Kumar [15] have clustered sensed data rather than the entire 
aggregated it at the cluster head node. The proposed method uses three algorithms, and the implementation of 
cluster node values using the self-organizing map was better than the other two algorithms. Lung and Zhou 
[16] proposed the distributed hierarchical agglomerative clustering (HAC) algorithm, which groups similar 
sensor nodes together and creates clusters before selecting the CH with the aiming of providing efficient 
clusters without the need for global network knowledge. Also, Raghunandan et al. [17] applied HAC 
algorithm to reduce some network issues and select (CH) efficiently to prolong the WSNs lifespan. 
Khorasani and Naji [18] used four algorithms including radial basis data aggregation (RBDA) with aiming of 
aggregating data, eliminate redundancy, as well as increase network accuracy and its energy. Also, Ullah and 
Youn [19] proposed a data aggregation system to reduce excessive and erroneous sensed data by relying on 
data clusters and the radial basis function has been applied in cluster heads to decrease the instability of the 
training process. The main contributions of this paper are stated as: 
— Apply (self-organization map (SOM), HAC, and radial basis function (RBF)) algorithms in WSNs 
environment using intel berkeley research lab dataset to enhance the WSNs performance. 
— Suggesting more than one method in the pre-processing stage led to obtaining high results in terms of 
clustering using the SOM and HAC algorithms. 
— The modification in the output hidden layer structure of the RBF algorithm yielded very high classification 
results (extended RBF algorithm). 
— Describe the impact of data aggregation strategy on WSNs with comprehensive and accurate literature that 
applied these algorithms. 


2. RESEARCH METHOD 
Our proposed model consists of three main phases: data collection, pre-processing and analysis, and 
implementation of algorithms. Figure | displays our proposed model block diagram. 
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Figure 1. The block diagram of our proposed model 
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2.1. Data collection 

Intel berkeley research lab dataset is an open-source dataset collected at the intel berkeley research 
lab. Capturing period extends from February 28th to April 5th, 2004. As shown in Figure 2, intel lab dataset 
composed of eight features as observational variables, whereas the captured sensed data items) observational 
instances reaches up to 2,313,682 instances. the dataset was collected using 54 Mica 2 Dot sensors along with 
weather boards deployed in the field of intel berkeley research lab. these boards measures temperature, 
humidity, and light weather attributes every 31 seconds [20]. 
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Figure 2. Intel berkeley research lab dataset 


2.2. Pre-processing and analysis 

To know the exact details and the total repeated measurements of each sensor, the use of each sensor 
to measure temperature, humidity, and light were measured to pave the way for the next steps and select the 
appropriate algorithms and functions that ultimately lead to the success of the data aggregation strategy and 
thus reduce the excess data as mentioned above. Figure 3 shows the frequency of measurement of the three 
phenomena for 54 sensors (Motid). 
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Figure 3. Frequency measurement of 54 sensors 
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2.2.1. Balanced distribution 

Figure 3 shows the unevenness of the sensor readings so that the measurement effort is very high on 
one sensor and the other is less than that, which will negatively affect the performance of the algorithms used 
in this paper. Because it will tend to the sensor that uses the most data and thus obtain inaccurately and few 
results. Therefore, a function was proposed to randomly redistribute the sensor measurements, while keeping 
the measurement values without modification. Figure 4(a) shows the original dataset, while Figure 4(b) 
shows the balanced distribution method. 


2.2.2. Normalization 

This step is necessary for the features of data that have a wide range of values, which forces a bias of 
the network weight towards higher values. The numerical content of the measured dataset was scaled to a new 
scale in this step, allowing it to be easily fed into the network as input. In Figure 4(c), the minimum and 
maximum normalization with values of 0.0,+0.5 was used in intel berkeley research lab dataset as given by (1). 


D- Min D 
Max D -Min D 


D = (NewMax D — NewMinD) + NewMin D (1) 


1 2 3 1 2 3 1 2 3 
122.1530 22.1719. 1.2586e+03 1 122.1530 -3.9190 11.0400 1 0.1893 0.4937 0.3406 
19.3220 44.5482 64.4000 2 19.9884 37.0933 45.0800 2 0.0681 0.4949 0.0174 
122.1530 -3.9190 104.8800 3 19.3024 38.4629 45.0800 3 0.1893 0.4922 0.0284 
21.8798 42.1830 9.2000 4 19.1652 38.8039 45.0800 4 0.0711 0.4948 0.0025 
122.1530 -39190 13.8000 5 19.1750 38.8379 45.0800 5 0.1893 0.4922 0.0037 
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Figure 4. Measurement of, (a) original dataset, (b) balanced distribution, and (c) normalization 


2.3. Implementation of algorithms 

In this paper, the dataset that has been collected in the data collection stage is unlabeled inherently. 
In the context of machine learning and artificial intelligence, one approach is utilizing the unsupervised 
learning schemes to label a given dataset. Supervised training of RBF network requires labeled dataset, 
however, due to the lack of labeled training dataset two powerful unsupervised learning algorithms are used: 
SOM network and HAC algorithm are used to this task. 


2.3.1. Self-organization map algorithm 

SOM is one of the neurobiologically-inspired neural networks and one of most important 
unsupervised learning algorithms. The most basic form of it is that proposed by the Finish researcher, Teuvo 
Kohonen in 1982 [21], [22]. Two prevalent use-cases for the unsupervised learning are the dimensionality 
reduction and the exploratory analysis. The most common and essential tasks of the explanatory analysis is 
the data clustering and analysis, where the system learns the actual trend (or structure) of the data without the 
need to explicit providing of the labels, which is the main task of utilizing SOM network in our work. 

In our proposed model, the main goal of utilizing SOM network is two-fold: first is to infer the 
natural structure exists in the intel lab dataset and secondly is to provide u-parameters to RBF network 
through applying another unsupervised clustering. As a starting point, in our case, we use SOM network for 
providing an initial explanatory analysis of the sensed dataset that collected in the data-collection stage. 


2.3.1.1. Applying self-organization map in WSNs 

SOM network comprises of two layers, at the bottom layer lies the input features of data instance, 
where each item is connected in parallel to all neurons that comprises the second layer which arranged in 
two-dimensional lattice as illustrated in Figure 5. 

Each neuron in the grid (as so-called codebook) in this lattice is a vector of weights (code vector) as 
same gradients length as that of the input training data items. The core idea of SOM is that each presentation 
of input vector adjusts the weight vectors of the winning node as well as the topological neighborhood till 
reaching a close resembling for the input vector. 
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Figure 5. Self-organizing map learning process 


Let's denote input training data subset taken from the pre-processed intel lab dataset, as 
xe IR3(temperature, humidity, and lights features) associating a weight vector wķe RÌ, k = 1,2,---,K. Each 
data presentation referred as x(t) where t represents the current presentation and range from t = 1 tot = P 
where P represents the total number of data instances in the training subset. Among many variants of SOM, 
in this work, we implemented a variant of SOM called " batch learning-SOM (BL-SOM) "[23]. In the on-line 
SOM, weights vectors of SOM codebook wx are updated after each input presentation x(t), whereas in 
BL-SOM, this updating occurs at the end of each epoch (one epoch is a single pass over the entire dataset). 
For each input vector x(t) is presented, the Euclidean distance between x(t) and each weight vector on the 
grid Wx is computed as in (2). 


Delt) = Ilx) — we Ol? (2) 
Afterwards, as shown Figure 5, the best match unit (BMU or winner neuron), is determined as in (3) 
Dr gemu (t) = min(D,(t)) (3) 


as mentioned before, in batch SOM, weight updating takes place at the end of each epoch, thus, let's define, 
to and t; define the start and the finish of each epoch, then, weight updating is given as in (4). 


E$? homu k (Ox) 
t) = H 4 
Wk ( s) Ley Rhpmu k(t) ( ) 


where; wz (tp) : weight vector of kt” SOM neuron computed at the end of Epoch ts. 
x(t) : tt"input training vector. 
Apmux(t): neighborhood function, controls the zone to which wx is able to adjust responding to an 


input vector x(t) that is most closely resembling Wpmu. In our work, we used the standard Gaussian 
function as illustrated in (5). 


hy bmu (t) = exp(—\I"% zg Tomu l/o) (5) 


where; rẹ : coordinates of kt” SOM neuron 
Tpmu : coordinates BMU neuron in response to tt” input vector 
a(t): width of neighborhood function, decreases with respect to t from a pre-specified value to a final 
value equal the width of one neuron 


2.3.2. Hierarchical agglomerative clustering algorithm 

HAC is a mathematically clustering algorithm and widely employed in modern machine learning 
and data science and demand to partition the database into a set of clusters and produce a hierarchical 
relationship within the data representations themselves [24]. Generally, hierarchical algorithms are 
categorized as agglomerative or divisive. The first category starts with one object, and then the closest pairs 
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of clusters are merged (agglomerates) successively with every iteration until all clusters are combined into 
one large cluster containing all pairs (bottom-up approach) while the second category called divisive that 
reverse approach to agglomerative clustering starting with one cluster of data and then splits the appropriate 
cluster [25]. HAC algorithm produces a tree-like structure called a dendrogram, designed to provide multiple, 
high-level parts of a database. This type of clustering help create a small cluster that provides an informative 
data presentation that begins from a non-redundant data set to the complete data set belonging to a single set. 


2.3.2.1. Applying HAC algorithm in WSNs 

Though powerful, major problems with SOM network that it is often intractable, is to attain a 
perfect or close-to-perfect mapping in such a way that SOM neurons are unique representations to the input 
data. Moreover, in this proposed model, to achieve high accurate data aggregation, label precision is an issue 
to be considered. 

In order to address both issues, and for better and more stable clustering performance in the WSN 
operation stage, we propose a novel solution by first using preprocessed intel lab dataset to generate the 
trained SOM grid map, then, we apply the hierarchical agglomerative clustering [26] on SOM's neurons per 
se. Figure 6 illustrates the general core idea behind HAC algorithm. 
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Figure 6. Concept of hierarchical agglomerative clustering algorithm 


In the first step, each weight vector on the SOM grid defines its own cluster. in our case, SOM grid 
has K neurons, therefore, we have K initial clusters. In a series of the subsequent steps, the algorithm of a 
hierarchical agglomeration always recognizes a pair of clusters that have the smallest value of mutual 
distance using the distance (linkage) metric Daverage described in (6). 


Daverage (cluster^, cluster”) = E (Xi i dist (Xai, Xj) (6) 
where; n, : number of objects in cluster4 

Ng : number of objects in cluster B 

dist : Euclidean distance between cluster“and cluster B 

Xai: is the it” object in cluster4 

Xpj: is the j” object in cluster? 


2.3.3. Radial basis function algorithm 

At the royal signals and radar establishment, radial basis function networks were first born and 
formulated in a paper published by Broomhead and Lowe [27], however, this type of neural networks was 
popularized by the researchers Moody and Darken [28]. In a concise definition, radial-basis function neural 
network is a special type of feedforward network architecture, composed from three layers only: input, 
hidden, and output layer. This type of neural networks can provide a local representation of a N-dimensional 
input space by using localized overlapping zones of distance-based functions called: network receptive fields. 
The core idea of radial basis function neural network resides in the theory of radial basis functions which, in 
turn, a pure mathematical discipline known as the approximation theory. Figure 7 shows the regular radial 
basis function algorithm. 
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Figure 7. Regular radial basis function algorithm structure with Gaussian activation functions 


2.3.3.1. Applying RBF algorithm in WSNs 

Typically, in terms of data aggregation, RBF is used to classify (interpolate) the most appropriate 
representative value for a set of sensed data items. Mathematically, consider a multivariate function 
(f: R? > R”), and without loss of generality, in our case, feature vectors composed of three features 
(temperature, humidity and light), i.e., R?,d = 3, and we expect to get a scalar value as the output of the 
RBF, i.e., R”,m = 1. However, in our proposed data aggregator function, RBF network classification model 
has to carry out a mapping from the continuous input space of sensed data X €R? into a finite set of classes 
Y = {1,2,3,---,£}, where, £ is the quantization levels, which, in turn, represents the classes (labels) of 
training data. Thus, we can formulate training data as in (7). 


Di. = {(x4,y")| xte R3 ye Y, t= 1,--,T} (7) 


In the recall phase (testing phase) further unlabeled observational sensed data 
Dt = {(x5)| xe R?,s = 1,=, $} is presented to trained RBF classification model, where RBF network 
estimate their class membership yt € Y accordingly. In our RBF model, we used the extended version of 
RBF network as elaborated in Figure 8. 


Input Feature vector 
x 


Hidden Layer Output Layer 


Figure 8. Extended radial basis function network structure with Gaussian activation functions 
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Where the output layer composed of a set of weighted summation nodes, one per each class. for each input 
feature vector, there is a corresponding weighted sum of the value of activation function from every RBF 
neuron. 

Each output node computes the "score" that corresponding to the category that represented. the 
classification decision is made by taking the maximum of output scores, and the output neuron with 
maximum value is the one that represent the category corresponding to the input feature vector. Based on the 
l-of-L encoding version of RBF network as shown in Figure 7, where we have K basis functions that 
performing a mapping (approximation) F: R? > R” as in (8). 


F = LK wa elx- ull) + wr l= 1-,L wy = by Hg (8) 


where,||-|| represents the Euclidean distance norm, wj, are real numbers used as weights associated with the 
activation functions @(-). wp refers to the biases which can absorb in the summation by implicate an extra 
basis function has a constant activation output set always to 1, i.e., @ọ = 1. Now, the class of output is 


determined as in (9). 

class(x) = argmax(F,(x)) L= 1,:,L (9) 
The mapping problem represented by (8) and (9) can be formulated in a matrix form as in (10). 

W =Y (10) 


where, W = (Wj, W2, ** W, ) each component is a vector of weights, Y = (yt, y?, =, y”) and @ is an L x L 
matrix defined as in (11). 


p= (Ppp) (11) 
where, oÍ = (|x oe uld » Hj = H's Ug is a real valued function depends on the " radial " norm 
distance between the input vector x and the origin or a certain point, in our case, this point is the 
corresponding peak center Hj ER? of the Gaussian function. Conventionally, the activation functions of the 
radial basis function network are chosen as Gaussian functions, which is well-known, and we adapted in our 
work, and it is given in (12). 


edle- lD = expli- wyl|’/207) (12) 


where, o is the standard deviation peak center and it specifies the width of the Gaussian function. The 
solution to (10) (which represents the optimal weights), is given by (13). 


W= $y (13) 


However, including the bias function in @ matrix, make it a non-square matrix, and therefore, we can't obtain 
the inverse ~t. Thus, to remedy this problem, a solution from linear algebra is used, which called the 
pseudo-inverse instead of inverse as in (14). 


W = ('6")*O'Y (14) 


where, #7 represents the transpose of d matrix. 


3. RESULTS AND DISCUSSION 
In this paper, the quantization error and the topographic error were measured respectively to prove 
the quality of SOM algorithm. 


3.1. Quantization error 

The average distance between a data point and its assigned node is measured by quantization error; 
the smaller the value, the better the fit [29]. The final quantization error after applying the SOM algorithm 
equals 0.005 as illustrated in Figure 9. 
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Figure 9. Quantization error 


3.2. Topographic error 

It's a metric for the quality of the spatial structure used in the map's design. For each input, the 
calculation is finished by determining the positions of the best matching neuron and the second best match in 
the map. The structure of this entry has been preserved if the nodes are next to each other. If it isn't, this is an 
error. As a result, the errors total number divided by the total number of data points equals the topographic 
error of a map [29]. The final topographic error after applying the dataset was 0.105. 

In this paper, the RBF algorithm was used as the classification engine of our proposed model. 
Before the training stage, we used (84000 samples) as X-train for measuring temperature, humidity, and light 
and 84000 samples as Y-train for labeling to measure the accuracy training. Also, 6000 samples as X-test and 
6000 samples as Y-train for labeling were used for measuring testing accuracy. The optimal weights of the 
extended RBF algorithm have been obtained after the training stage performed and then fed into the next 
stage. Our proposed model has achieved overall classification accuracy reached 97.54% and 97.70% for 
training and testing accuracy respectively. The high results obtained from the training and testing step prove 
the efficiency of modification in the output of the hidden layer of the RBF algorithm. This high result gave us 
the green light to propose a new protocol that uses the remaining samples from the intel berkeley research lab 
dataset to aggregate the data with the aiming of improving WSN performance in terms of energy, accuracy, 
and latency. 


4. CONCLUSION 

Data aggregation is the most powerful concern in WSNs that must be considered in all aspects of 
these networks. A comprehensive description of three algorithms, namely (SOM, HAC, RBF) for WSNs was 
presented in this paper, starting from their structure, fundamental components, and arithmetic, ending with 
applying these algorithms to data aggregation for WSNs. The results obtained prove that the proposed work 
can give effective performance to WSNs after modifying the output of the hidden layer of the RBF algorithm. 
This paper can be a guide for finding appropriate solutions to overcome or reduce data aggregation issues. In 
future work, we can propose a new protocol that implements the three algorithms with the help of a data 
aggregation strategy for obtaining the best performance of WSNs in terms of energy, accuracy, and latency. 
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